Topic modelling with spaCy and scikit-learn

Notebook:

https://www.kaggle.com/thebrownviking20/topic-modelling-with-spacy-and-scikit-learn

Aim and Motivation

Nirant's latest kernel on spaCy: Hitchhiker's Guide to NLP in spaCy has made me realize that spaCy maybe as good or even better than NLTK for Natural Language Processing. My recent kernels deal with deep learning and I want to extend that by using text data for deep learning and intend to use spaCy for processing and modelling this data.

In [1]:
# Usual imports
import numpy as np
import pandas as pd
from tqdm import tqdm
import string
import matplotlib.pyplot as plt
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.manifold import TSNE
import concurrent.futures
import time
import pyLDAvis.sklearn
from pylab import bone, pcolor, colorbar, plot, show, rcParams, savefig
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
import os
wine_data = os.getcwd() + "\\data\\input\\"
print(os.listdir(wine_data))


# Plotly based imports for visualization
import plotly
from plotly import tools
import plotly.plotly as py
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff
plotly.tools.set_credentials_file(username=os.environ['PLOTLY_USERNAME'], api_key=os.environ['PLOTLY_API_KEY'])


# spaCy based imports
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
['winemag-data-130k-v2.csv', 'winemag-data-130k-v2.json', 'winemag-data_first150k.csv']
In [2]:
# Loading data
wines = pd.read_csv(wine_data + '\\winemag-data_first150k.csv')
wines.head()
Out[2]:
Unnamed: 0 country description designation points price province region_1 region_2 variety winery
0 0 US This tremendous 100% varietal wine hails from ... Martha's Vineyard 96 235.0 California Napa Valley Napa Cabernet Sauvignon Heitz
1 1 Spain Ripe aromas of fig, blackberry and cassis are ... Carodorum Selección Especial Reserva 96 110.0 Northern Spain Toro NaN Tinta de Toro Bodega Carmen Rodríguez
2 2 US Mac Watson honors the memory of a wine once ma... Special Selected Late Harvest 96 90.0 California Knights Valley Sonoma Sauvignon Blanc Macauley
3 3 US This spent 20 months in 30% new French oak, an... Reserve 96 65.0 Oregon Willamette Valley Willamette Valley Pinot Noir Ponzi
4 4 France This is the top wine from La Bégude, named aft... La Brûlade 95 66.0 Provence Bandol NaN Provence red blend Domaine de la Bégude
In [3]:
# Creating a spaCy object
nlp = spacy.load('en_core_web_lg')

spaCy also comes with a built-in named entity visualizer that lets you check your model's predictions in your browser. You can pass in one or more Doc objects and start a web server, export HTML files or view the visualization directly from a Jupyter Notebook.

Named Entity Recognition

Named Entity Recognition is an information extraction task where named entities in unstructured sentences are located and classified in some pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

In [4]:
doc = nlp(wines["description"][3])
spacy.displacy.render(doc, style='ent',jupyter=True)
This spent 20 months DATE in 30% PERCENT new French NORP oak, and incorporates fruit from Ponzi ORG 's Aurora PERSON , Abetina GPE and Madrona vineyards ORG , among others. Aromatic, dense and toasty, it deftly blends aromas and flavors of toast, cigar box, blackberry, black cherry, coffee and graphite. Tannins are polished to a fine sheen, and frame a finish loaded with dark chocolate and espresso. Drink now through 2032 DATE .
In [5]:
## Stopwords 

from IPython.display import Image
import os
Images = os.getcwd() + "\Images"
Image(filename= Images + '\StopWords.png')
Out[5]:
In [6]:
punctuations = string.punctuation
stopwords = list(STOP_WORDS)
In [7]:
print("Number of Stop Words wrt to spaCy is: ", len(stopwords))
Number of Stop Words wrt to spaCy is:  305

Lemmatization

It is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. Words like "ran" and "running" are converted to "run" to avoid having words with similar meanings in our data.

In [8]:
review = str(" ".join([i.lemma_ for i in doc]))
In [9]:
doc = nlp(review)
spacy.displacy.render(doc, style='ent',jupyter=True)
this spend 20 month DATE in 30 % PERCENT new french oak , and incorporate fruit from ponzi 's aurora , abetina and madrona vineyard , among other . aromatic , dense and toasty , -PRON- deftly blend aroma and flavor of toast , cigar box , blackberry , black cherry , coffee and graphite . tannin be polish to a fine sheen , and frame a finish load with dark chocolate and espresso . drink now through 2032 DATE .

The sentence looks much different now that it is lemmatized.

Parts of Speech tagging

This is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech,[1] based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

In [10]:
# POS tagging
for i in nlp(review):
    print(i,"=>",i.pos_)
this => DET
spend => NOUN
20 => NUM
month => NOUN
in => ADP
30 => NUM
% => NOUN
new => ADJ
french => ADJ
oak => NOUN
, => PUNCT
and => CCONJ
incorporate => VERB
fruit => NOUN
from => ADP
ponzi => NOUN
's => PART
aurora => NOUN
, => PUNCT
abetina => ADJ
and => CCONJ
madrona => NOUN
vineyard => NOUN
, => PUNCT
among => ADP
other => ADJ
. => PUNCT
aromatic => ADJ
, => PUNCT
dense => ADJ
and => CCONJ
toasty => ADJ
, => PUNCT
-PRON- => PUNCT
deftly => ADV
blend => VERB
aroma => NOUN
and => CCONJ
flavor => NOUN
of => ADP
toast => NOUN
, => PUNCT
cigar => NOUN
box => NOUN
, => PUNCT
blackberry => NOUN
, => PUNCT
black => ADJ
cherry => NOUN
, => PUNCT
coffee => NOUN
and => CCONJ
graphite => NOUN
. => PUNCT
tannin => NOUN
be => VERB
polish => NOUN
to => ADP
a => DET
fine => ADJ
sheen => NOUN
, => PUNCT
and => CCONJ
frame => VERB
a => DET
finish => NOUN
load => NOUN
with => ADP
dark => ADJ
chocolate => NOUN
and => CCONJ
espresso => NOUN
. => PUNCT
drink => VERB
now => ADV
through => ADP
2032 => NUM
. => PUNCT
In [11]:
# Parser for reviews
parser = English()
def spacy_tokenizer(sentence):
    mytokens = parser(sentence)
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in stopwords and word not in punctuations ]
    mytokens = " ".join([i for i in mytokens])
    return mytokens
In [12]:
tqdm.pandas()
wines["processed_description"] = wines["description"].progress_apply(spacy_tokenizer)
100%|█████████████████████████████████████████████████████████████████████████| 150930/150930 [05:07<00:00, 490.24it/s]

What is topic-modelling?

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words.

The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is. It involves various techniques of dimensionality reduction(mostly non-linear) and unsupervised learning like LDA, SVD, autoencoders etc.

Source: Wikipedia

In [13]:
# Creating a vectorizer
vectorizer = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(wines["processed_description"])
In [14]:
NUM_TOPICS = 10
In [15]:
%%time

# Latent Dirichlet Allocation Model
lda = LatentDirichletAllocation(n_components=NUM_TOPICS, max_iter=10, learning_method='online',verbose=True)
data_lda = lda.fit_transform(data_vectorized)
iteration: 1 of max_iter: 10
iteration: 2 of max_iter: 10
iteration: 3 of max_iter: 10
iteration: 4 of max_iter: 10
iteration: 5 of max_iter: 10
iteration: 6 of max_iter: 10
iteration: 7 of max_iter: 10
iteration: 8 of max_iter: 10
iteration: 9 of max_iter: 10
iteration: 10 of max_iter: 10
Wall time: 10min 33s
In [16]:
%%time

# Non-Negative Matrix Factorization Model
nmf = NMF(n_components=NUM_TOPICS)
data_nmf = nmf.fit_transform(data_vectorized) 
Wall time: 31.2 s
In [17]:
%%time

# Latent Semantic Indexing Model using Truncated SVD
lsi = TruncatedSVD(n_components=NUM_TOPICS)
data_lsi = lsi.fit_transform(data_vectorized)
Wall time: 2.7 s
In [18]:
# Functions for printing keywords for each topic
def selected_topics(model, vectorizer, top_n=10):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-top_n - 1:-1]]) 
In [19]:
# Keywords for topics clustered by Latent Dirichlet Allocation
print("LDA Model:")
selected_topics(lda, vectorizer)
LDA Model:
Topic 0:
[('flavor', 13910.311094472272), ('apple', 12503.60484448797), ('finish', 11589.27884017539), ('white', 11088.053936393517), ('peach', 10843.658325583567), ('fruit', 10629.221342077399), ('citrus', 10171.437361485016), ('wine', 9081.461594533948), ('note', 8274.27321615729), ('pear', 8203.489669157343)]
Topic 1:
[('spin-dry', 15632.972351278295), ('flavor', 13810.152980934701), ('wine', 11734.104416773636), ('acidity', 10631.654622354385), ('fruity', 6076.518678966786), ('good', 5535.3195415950295), ('crisp', 5504.079711976185), ('sweet', 5311.045480729165), ('nice', 5268.502963344812), ('balance', 5154.187032795508)]
Topic 2:
[('flavor', 21225.573792933817), ('cherry', 16123.842481384361), ('blackberry', 14150.855373020668), ('sweet', 11619.621155512392), ('wine', 9606.793063333103), ('chocolate', 8545.138742315943), ('oak', 8450.572577287261), ('tannin', 8033.477806100047), ('rich', 7091.5415339218225), ('black', 6346.908384489133)]
Topic 3:
[('cherry', 12793.10144756084), ('pinot', 9666.919396613823), ('good', 9359.162569295793), ('flavor', 8828.928026695961), ('wine', 8468.987767145074), ('red', 6996.63527215191), ('simple', 6253.090705693131), ('spice', 5798.755174346871), ('raspberry', 5612.663976316225), ('drink', 5117.970719029046)]
Topic 4:
[('black', 12929.692261303904), ('fruit', 12544.83127231677), ('aroma', 11818.972054961), ('cherry', 11728.747988346158), ('spice', 11414.317169037871), ('tannin', 9932.935522553871), ('berry', 9572.468047373419), ('note', 9482.070087333359), ('cabernet', 9016.882955582889), ('wine', 7835.528042936244)]
Topic 5:
[('oak', 11000.553088998866), ('little', 7257.55338841055), ('chardonnay', 6657.860319573845), ('barrel', 5895.746898791742), ('new', 4468.725254847325), ('year', 4462.507165933575), ('pure', 3695.9153394921464), ('shows', 3416.6600730958344), ('fruit', 2950.077180398317), ('smoky', 2608.829190216893)]
Topic 6:
[('flavor', 14710.210700105403), ('finish', 13053.68947331578), ('palate', 12411.629325065494), ('aroma', 11458.116579906378), ('fruit', 9491.910800729083), ('nose', 7741.351071375968), ('berry', 7195.582687959038), ('herbal', 6839.699106991653), ('bite', 6195.900130908386), ('good', 5679.463924729857)]
Topic 7:
[('vineyard', 6735.085589976619), ('blend', 4375.657064794247), ('syrah', 3359.1375742293994), ('grape', 3342.962630592729), ('winery', 3196.9266410611476), ('dusty', 3172.495525422131), ('bottle', 2662.4086811554903), ('load', 2558.466554608588), ('valley', 2271.023191746224), ('vintage', 2136.3608038643224)]
Topic 8:
[('wine', 37676.49345469437), ('fruit', 26402.807559749137), ('tannin', 14923.080103259137), ('year', 11753.45195731563), ('age', 11344.268893496332), ('ripe', 11220.48797581326), ('acidity', 11075.026894796702), ('flavor', 10259.996166211286), ('good', 10143.521091548944), ('structure', 8737.114225790243)]
Topic 9:
[('wine', 10455.70967276106), ('light', 7913.250480499), ('fresh', 5961.474097838814), ('imported', 4463.201278672147), ('touch', 4422.394426917691), ('sauvignon', 4231.826362928134), ('blanc', 3932.335510582616), ('crisp', 3804.816324043661), ('fruit', 3713.524796806457), ('lean', 3441.6917426779623)]
In [20]:
# Keywords for topics clustered by Latent Semantic Indexing
print("NMF Model:")
selected_topics(nmf, vectorizer)
NMF Model:
Topic 0:
[('flavor', 20.81952845224774), ('spin-dry', 2.2459451867874747), ('sweet', 1.7037379638296288), ('oak', 1.6616461555639146), ('vanilla', 1.0847685006851087), ('little', 1.0483342271945022), ('blackberry', 0.9726837433204013), ('finish', 0.9531256186562851), ('like', 0.8693193967876094), ('bite', 0.8677148848374735)]
Topic 1:
[('wine', 15.713752736014552), ('age', 0.9872316169436435), ('year', 0.7136357109960517), ('spice', 0.5784341734285087), ('character', 0.5204075225277413), ('structure', 0.5179454197918056), ('wood', 0.4835812233773158), ('texture', 0.4718759893684443), ('great', 0.4562303936064821), ('like', 0.342218832405483)]
Topic 2:
[('fruit', 16.496642194428624), ('red', 1.2711414203837506), ('wood', 0.6420905068234185), ('tropical', 0.6056650548437337), ('barrel', 0.5738016422463286), ('spice', 0.4918420190143807), ('stone', 0.479365152195368), ('oak', 0.4366380324636827), ('hint', 0.3890565771093833), ('age', 0.3869062248685764)]
Topic 3:
[('tannin', 10.53482442207948), ('black', 8.374080655822787), ('dark', 2.662818450116487), ('firm', 2.399960953373403), ('blackberry', 2.37885929835856), ('currant', 2.2919462560500787), ('year', 2.157331344586757), ('plum', 1.9283684791990738), ('structure', 1.914010197150526), ('age', 1.5236908426017197)]
Topic 4:
[('finish', 8.864847524285242), ('aroma', 7.789235250599788), ('palate', 7.230771392561514), ('note', 4.358105420888774), ('berry', 3.605695418860381), ('nose', 3.3652627328331457), ('plum', 2.1226095861044394), ('spice', 2.0631938871355406), ('offer', 1.475960608573542), ('long', 1.4024328441652911)]
Topic 5:
[('acidity', 10.607851550735798), ('fresh', 4.747457597383805), ('crisp', 4.534297478210633), ('apple', 3.898523344323601), ('citrus', 2.9464921700467888), ('white', 2.9416188156390732), ('peach', 2.4311729598276957), ('green', 2.1381590599035607), ('lemon', 2.0127135623703287), ('pear', 1.965758756490374)]
Topic 6:
[('cherry', 14.457246942695777), ('red', 3.787227728346502), ('raspberry', 3.0002079112118105), ('spice', 2.954751918995478), ('pinot', 2.377364863741714), ('colon', 1.9878869462930433), ('noir', 1.3573455968848906), ('silky', 1.224592702877589), ('oak', 1.010223014469044), ('soft', 0.9514762143155973)]
Topic 7:
[('good', 15.061035435045119), ('balance', 1.685339150120586), ('price', 0.7560717364146413), ('pretty', 0.46718596462054696), ('structure', 0.425514544529992), ('berry', 0.3808208279882524), ('year', 0.36434830722415573), ('nose', 0.34194152753667284), ('oak', 0.33225009757581486), ('integrate', 0.32381585318693407)]
Topic 8:
[('cabernet', 8.547966381746388), ('blend', 8.312503422644433), ('sauvignon', 5.42395727066336), ('merlot', 4.192704413022929), ('franc', 2.2990979350214973), ('blackberry', 2.276555061233735), ('syrah', 2.166254302331351), ('oak', 1.4826857589171305), ('chocolate', 1.3084346633263058), ('verdot', 1.1782718028834778)]
Topic 9:
[('ripe', 10.574798323492786), ('drink', 8.692757111112904), ('rich', 6.203521723948055), ('soft', 3.514871789824879), ('sweet', 3.2048262200760673), ('texture', 1.9006835429096383), ('blackberry', 1.4922579002152123), ('oak', 1.4057508612946463), ('ready', 1.0788971652333808), ('smooth', 0.9582889528504147)]
In [21]:
# Keywords for topics clustered by Non-Negative Matrix Factorization
print("LSI Model:")
selected_topics(lsi, vectorizer)
LSI Model:
Topic 0:
[('wine', 0.46545443759708577), ('flavor', 0.37869474959866267), ('fruit', 0.348779378056952), ('finish', 0.18255713440491594), ('cherry', 0.1771505674010354), ('tannin', 0.15778296955056645), ('aroma', 0.15052435215445692), ('good', 0.14422784147493695), ('acidity', 0.14296701897846634), ('black', 0.1303079932501738)]
Topic 1:
[('wine', 0.6964235791945285), ('fruit', 0.16932650092504276), ('age', 0.0912572225867933), ('acidity', 0.06372731032942196), ('structure', 0.048577800393516196), ('year', 0.048278019016740635), ('wood', 0.0464815266398888), ('character', 0.040819555559468056), ('great', 0.030645638002628334), ('rich', 0.029708856592093342)]
Topic 2:
[('fruit', 0.7154726307587135), ('black', 0.1628907681170831), ('tannin', 0.11289922982540755), ('palate', 0.10284083615575716), ('aroma', 0.09492634935317329), ('note', 0.08377020183975617), ('berry', 0.07031491758661514), ('dark', 0.06660692106307047), ('finish', 0.06488066701944473), ('red', 0.06455231703113201)]
Topic 3:
[('cherry', 0.41569825814369127), ('tannin', 0.33186083428973423), ('black', 0.312672998232449), ('blackberry', 0.18431294275656698), ('cabernet', 0.12064202634848364), ('currant', 0.11281537374857413), ('chocolate', 0.09912794483697324), ('dark', 0.09245462112015895), ('year', 0.09092966801976755), ('spice', 0.08566238417155148)]
Topic 4:
[('aroma', 0.3517438903844236), ('wine', 0.30326073714973495), ('palate', 0.290996615688953), ('finish', 0.23929524941922156), ('note', 0.20026608419114816), ('berry', 0.14176513945589228), ('spice', 0.13781081102370044), ('nose', 0.12664953953251845), ('cherry', 0.09834828045443263), ('offer', 0.09536814514696014)]
Topic 5:
[('acidity', 0.475317077977643), ('drink', 0.273074202505826), ('cherry', 0.21023279775840847), ('note', 0.18709086328270147), ('ripe', 0.16495845105360515), ('crisp', 0.14462541900650386), ('fresh', 0.13873106395094734), ('spin-dry', 0.13062141963777535), ('tannin', 0.1268693919646988), ('good', 0.11704670606647417)]
Topic 6:
[('cherry', 0.5323965272673715), ('red', 0.29015178505434924), ('fruit', 0.21499936954581944), ('raspberry', 0.16232116423179618), ('pinot', 0.14717754656903043), ('light', 0.11190239323880621), ('fresh', 0.1102708005709056), ('spice', 0.1075837739210273), ('colon', 0.08716269504896577), ('bright', 0.085598578416376)]
Topic 7:
[('good', 0.7657325175209471), ('berry', 0.2076086180788654), ('red', 0.1702125868312064), ('aroma', 0.14780812020762862), ('acidity', 0.13283586222303329), ('palate', 0.09492365339407223), ('fresh', 0.09328421246220242), ('balance', 0.08059134397547753), ('plum', 0.06956328870445924), ('tannin', 0.059009559568636925)]
Topic 8:
[('good', 0.488851295368366), ('oak', 0.24761086823921102), ('cherry', 0.18057788128434674), ('sweet', 0.15400591500494454), ('note', 0.15361689541073106), ('finish', 0.11844070362944882), ('spice', 0.11100050074922768), ('vanilla', 0.1102070161539246), ('blend', 0.09747741869782847), ('rich', 0.09685978317906506)]
Topic 9:
[('aroma', 0.36496342423513917), ('ripe', 0.34075993795268544), ('sweet', 0.2653659545077754), ('spice', 0.24780072082207677), ('rich', 0.17561240142123002), ('oak', 0.13573668056629712), ('vanilla', 0.12384007495561028), ('palate', 0.11838166725946682), ('soft', 0.10220543409782581), ('toast', 0.08558601553955775)]
In [22]:
# Transforming an individual sentence
text = spacy_tokenizer("Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.")
x = lda.transform(vectorizer.transform([text]))[0]
print(x)
[0.45736211 0.00500124 0.13553722 0.00500019 0.37209658 0.00500006
 0.00500105 0.00500004 0.00500053 0.00500099]

The index in the above list with the largest value represents the most dominant topic for the given review.

Visualizing LDA results with pyLDAvis

In [23]:
pyLDAvis.enable_notebook()
dash = pyLDAvis.sklearn.prepare(lda, data_vectorized, vectorizer, mds='tsne')
dash
Out[23]:

How to interpret this graph?

1. Topics on the left while their respective keywords are on the right.
2. Larger topics are more frequent and closer the topics, mor the similarity
3. Selection of keywords is based on their frequency and discriminancy.

Hover over the topics on the left to get information about their keywords on the right.

Visualizing LSI(SVD) scatterplot

We will be visualizing our data for 2 topics to see similarity between keywords which is measured by distance with the markers

In [24]:
svd_2d = TruncatedSVD(n_components=2)
data_2d = svd_2d.fit_transform(data_vectorized)
In [25]:
# Plotly based imports for visualization
import plotly
from plotly import tools
import plotly.plotly as py
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff
plotly.tools.set_credentials_file(username=os.environ['PLOTLY_USERNAME'], api_key=os.environ['PLOTLY_API_KEY'])
In [26]:
trace = go.Scattergl(
    x = data_2d[:,0],
    y = data_2d[:,1],
    mode = 'markers',
    marker = dict(
        color = '#FFBAD2',
        line = dict(width = 1)
    ),
    text = vectorizer.get_feature_names(),
    hovertext = vectorizer.get_feature_names(),
    hoverinfo = 'text' 
)
data = [trace]
iplot(data, filename='scatter-mode')

The text version of scatter plot looks messy but you can zoom it for great results

In [27]:
trace = go.Scattergl(
    x = data_2d[:,0],
    y = data_2d[:,1],
    mode = 'text',
    marker = dict(
        color = '#FFBAD2',
        line = dict(width = 1)
    ),
    text = vectorizer.get_feature_names()
)
data = [trace]
iplot(data, filename='text-scatter-mode')

Let's see what happens when we use a spaCy based bigram tokenizer for topic modelling

In [28]:
def spacy_bigram_tokenizer(phrase):
    doc = parser(phrase) # create spacy object
    token_not_noun = []
    notnoun_noun_list = []
    noun = ""

    for item in doc:
        if item.pos_ != "NOUN": # separate nouns and not nouns
            token_not_noun.append(item.text)
        if item.pos_ == "NOUN":
            noun = item.text
        
        for notnoun in token_not_noun:
            notnoun_noun_list.append(notnoun + " " + noun)

    return " ".join([i for i in notnoun_noun_list])
In [29]:
bivectorizer = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, ngram_range=(1,2))
bigram_vectorized = bivectorizer.fit_transform(wines["processed_description"])

LDA for bigram data

In [30]:
bi_lda = LatentDirichletAllocation(n_components=NUM_TOPICS, max_iter=10, learning_method='online',verbose=True)
data_bi_lda = bi_lda.fit_transform(bigram_vectorized)
iteration: 1 of max_iter: 10
iteration: 2 of max_iter: 10
iteration: 3 of max_iter: 10
iteration: 4 of max_iter: 10
iteration: 5 of max_iter: 10
iteration: 6 of max_iter: 10
iteration: 7 of max_iter: 10
iteration: 8 of max_iter: 10
iteration: 9 of max_iter: 10
iteration: 10 of max_iter: 10

Topics for bigram model

In [31]:
print("Bi-LDA Model:")
selected_topics(bi_lda, bivectorizer)
Bi-LDA Model:
Topic 0:
[('aroma', 8171.939114567469), ('spice', 5809.372571449917), ('offer', 5331.169147038662), ('cherry', 4350.678353529048), ('note', 3607.7317295601993), ('open', 3076.261797879477), ('berry', 2876.2881402200137), ('close', 2599.526877954569), ('wine', 2557.8525887518463), ('deliver', 2555.2060764423277)]
Topic 1:
[('dry', 14297.620689602543), ('crisp', 13728.049126273025), ('acidity', 12764.28572779791), ('wine', 11967.411192831361), ('flavor', 11491.437014572175), ('fresh', 11375.161327465637), ('citrus', 11245.942104056649), ('spin', 11236.000061966615), ('spin dry', 11105.957606335078), ('white', 10285.785013975243)]
Topic 2:
[('berry', 16447.15814347084), ('flavor', 11793.595809772707), ('palate', 11610.35546401836), ('fruit', 11603.643944363883), ('finish', 11589.807150406763), ('aroma', 11542.877887689086), ('plum', 11348.854561666963), ('red', 9619.187697616971), ('cherry', 7174.890042967764), ('nose', 6943.665781579269)]
Topic 3:
[('wine', 34993.84730259047), ('fruit', 24672.702859197616), ('age', 13068.276416710705), ('acidity', 12607.660525266143), ('tannin', 12439.292903893094), ('ripe', 11409.495836813861), ('year', 9140.763846708842), ('rich', 8982.674540040472), ('structure', 8698.941986399892), ('flavor', 8100.392433254124)]
Topic 4:
[('flavor', 19142.221509754756), ('cherry', 16204.911370320202), ('little', 10016.989875755235), ('pinot', 8828.466104940297), ('oak', 8220.467739974903), ('wine', 8062.787661711842), ('dry', 7282.527465241971), ('raspberry', 7206.310929921604), ('drink', 6831.655436469686), ('spin', 6046.278501292608)]
Topic 5:
[('wine', 20183.08467060814), ('sweet', 13781.761659346064), ('soft', 10429.3333926482), ('fruit', 9850.633953821034), ('ripe', 7383.906415032424), ('rich', 7053.032118272016), ('touch', 6361.30003993727), ('spice', 6270.921552790501), ('mouth', 5348.428918742546), ('smooth', 4384.050103663491)]
Topic 6:
[('flavor', 19002.30913546758), ('palate', 9846.143921657142), ('aroma', 9020.404272643702), ('finish', 8531.592899903097), ('nose', 7183.417994512016), ('chardonnay', 6340.184313097824), ('green', 6179.099633122222), ('vanilla', 5945.662277298789), ('fruit', 5209.155038952789), ('pineapple', 5205.251257544348)]
Topic 7:
[('grape', 4277.193154560114), ('blend', 4081.4159122773813), ('wine', 3708.643391534127), ('vineyard', 3292.5121730489), ('valley', 2722.965700681968), ('vintage', 2687.4778628964204), ('winery', 2634.0460619951036), ('strawberry', 2192.1291704990786), ('floral', 1989.215556975629), ('variety', 1886.1583965979871)]
Topic 8:
[('fruit', 10562.749868338547), ('finish', 9919.740776209628), ('wine', 7979.824109109332), ('good', 6649.608637075669), ('medium', 6417.041252128187), ('bodied', 5768.706154001304), ('spicy', 5365.572140366616), ('slightly', 4702.535714516787), ('flavor', 4264.723504826806), ('note', 4002.19462488706)]
Topic 9:
[('blackberry', 18568.078819995935), ('black', 15583.692357280552), ('tannin', 14367.114942853732), ('cherry', 12946.548716445348), ('cabernet', 11655.043799462848), ('flavor', 10659.65429310907), ('dry', 9649.740200722787), ('chocolate', 8738.495697207223), ('blend', 8444.66077989525), ('currant', 8150.809066488213)]
In [32]:
bi_dash = pyLDAvis.sklearn.prepare(bi_lda, bigram_vectorized, bivectorizer, mds='tsne')
bi_dash
Out[32]:

Results

Very few keywords with 2 words have been found like "spin dry" , "black cherry", etc.

In [55]:
#environment and package versions
print('\n')
print("_"*70)
print('The environment and package versions used in this script are:')
print('\n')

import platform
import sys
import bs4
from bs4 import BeautifulSoup
import urllib
import re
import textacy
import spacy
import gensim
import sklearn
import scipy
import matplotlib
import cufflinks as cf
import IPython
import mglearn

print(platform.platform())
print('Python', sys.version)
print("pandas version:", pd.__version__)
print('OS', os.name)
print('Numpy', np.__version__)
print('Beautiful Soup', bs4.__version__)
print('Urllib', urllib.request.__version__) 
print('Regex', re.__version__)
print('Textacy', textacy.__version__)
print('spaCy', spacy.__version__)
print('gensim', gensim.__version__)
print('scikit-learn version', sklearn.__version__)
print('scipy', scipy.__version__)
print('matplotlib', matplotlib.__version__)
print('plotly', plotly.__version__)
print('Cufflinks', cf.__version__)
print("IPython version:", IPython.__version__)
print("mglearn version:", mglearn.__version__)
print ("Anaconda Python Environment is: ", os.environ['CONDA_DEFAULT_ENV'])

print('\n')
print("~"*70)
print('\n')

______________________________________________________________________
The environment and package versions used in this script are:


Windows-10-10.0.17134-SP0
Python 3.6.6 |Anaconda custom (64-bit)| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)]
pandas version: 0.23.4
OS nt
Numpy 1.15.4
Beautiful Soup 4.6.3
Urllib 3.6
Regex 2.2.1
Textacy 0.6.2
spaCy 2.0.12
gensim 3.5.0
scikit-learn version 0.19.2
scipy 1.1.0
matplotlib 2.2.2
plotly 3.4.2
Cufflinks 0.12.1
IPython version: 6.5.0
mglearn version: 0.1.7
Anaconda Python Environment is:  py36_text_analytics


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


In [ ]: